SEAME: a Mandarin-English code-switching speech corpus in south-east asia

نویسندگان

Dau-Cheng Lyu

Tien Ping Tan

Chng Eng Siong

Haizhou Li

چکیده

In Singapore and Malaysia, people often speak a mixture of Mandarin and English within a single sentence. We call such sentences intra-sentential code-switch sentences. In this paper, we report on the development of a Mandarin-English codeswitching spontaneous speech corpus: SEAME. The corpus is developed as part of a multilingual speech recognition project and will be used to examine how Mandarin-English codeswitch speech occurs in the spoken language in South-East Asia. Additionally, it can provide insights into the development of large vocabulary continuous speech recognition (LVCSR) for code-switching speech. The corpus collected consists of intra-sentential code-switching utterances that are recorded under both interview and conversational settings. This paper describes the corpus design and the analysis of collected corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of a Mandarin-English Code-switching Speech Corpus: SEAME

SEAME (South East Asia Mandarin-English) is a 30 hours spontaneous Mandarin-English code-switching speech corpus recorded from Singapore and Malaysia speakers. In this paper, we report a series of analyses on the recording, processing time and voice activity rate (VAR) of the speech recording, transcription, validation and language boundaries labeling processes. In addition, the duration of the...

متن کامل

Features for factored language models for code-Switching speech

This paper presents investigations of features which can be used to predict Code-Switching speech. For this task, factored language models are applied and implemented into a state-of-the-art decoder. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and open class word clusters are explored. We find that Brown word clusters, part-of-speech tag...

متن کامل

A Mandarin-English Code-Switching Corpus

Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of codeswitching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four pa...

متن کامل

Functions of Code-Switching Strategies among Iranian EFL Learners and Their Speaking Ability Improvement through Code-Switching

This study investigated the impact of code-switching on speaking ability of Iranian low proficiency EFL learners. Moreover, it was an attempt to show what functions existed behind code-switching strategies used by the EFL learners. To this end, 60 male and female Iranian EFL learners age-ranged between 20 and 30 participated in the study. Data collection instruments which were used were the Int...

متن کامل

Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings

Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon, and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English cod...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

SEAME: a Mandarin-English code-switching speech corpus in south-east asia

نویسندگان

چکیده

منابع مشابه

An Analysis of a Mandarin-English Code-switching Speech Corpus: SEAME

Features for factored language models for code-Switching speech

A Mandarin-English Code-Switching Corpus

Functions of Code-Switching Strategies among Iranian EFL Learners and Their Speaking Ability Improvement through Code-Switching

Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings

عنوان ژورنال:

اشتراک گذاری